DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

04/10/2023
by   Lewei Yao, et al.
6

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4 outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5 respectively, and even beats its fully-supervised counterpart by a large margin.

READ FULL TEXT

page 1

page 4

page 12

page 13

research
11/27/2022

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Existing object detection methods are bounded in a fixed-set vocabulary ...
research
09/20/2022

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Open-world object detection, as a more general and challenging goal, aim...
research
03/17/2023

Enhancing the Role of Context in Region-Word Alignment for Object Detection

Vision-language pretraining to learn a fine-grained, region-word alignme...
research
06/16/2023

Scaling Open-Vocabulary Object Detection

Open-vocabulary object detection has benefited greatly from pretrained v...
research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...
research
08/07/2023

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Vision-Language Pre-training (VLP) methods based on object detection enj...
research
07/12/2022

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has...

Please sign up or login with your details

Forgot password? Click here to reset