Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

05/11/2023
by   Dahun Kim, et al.
0

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

READ FULL TEXT

page 4

page 8

page 13

page 14

research
09/02/2023

Contrastive Feature Masking Open-Vocabulary Vision Transformer

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
research
12/16/2021

RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has...
research
07/07/2022

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Existing open-vocabulary object detectors typically enlarge their vocabu...
research
03/25/2023

Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

Prompt-OVD is an efficient and effective framework for open-vocabulary o...
research
03/23/2023

Three ways to improve feature alignment for open vocabulary detection

The core problem in zero-shot open vocabulary detection is how to align ...
research
03/17/2023

Enhancing the Role of Context in Region-Word Alignment for Object Detection

Vision-language pretraining to learn a fine-grained, region-word alignme...
research
08/04/2023

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Open-vocabulary segmentation is a challenging task requiring segmenting ...

Please sign up or login with your details

Forgot password? Click here to reset