Three ways to improve feature alignment for open vocabulary detection

03/23/2023
by   Relja Arandjelović, et al.
0

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.

READ FULL TEXT

page 1

page 14

page 15

page 16

page 17

page 18

research
09/24/2021

ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation

Real-world object sampling produces long-tailed distributions requiring ...
research
09/02/2023

Contrastive Feature Masking Open-Vocabulary Vision Transformer

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
research
05/11/2023

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...
research
11/18/2019

Dont Even Look Once: Synthesizing Features for Zero-Shot Detection

Zero-shot detection, namely, localizing both seen and unseen objects, in...
research
06/12/2023

Augmenting Zero-Shot Detection Training with Image Labels

Zero-shot detection (ZSD), i.e., detection on classes not seen during tr...
research
03/04/2023

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Benefiting from large-scale vision-language pre-training on image-text p...

Please sign up or login with your details

Forgot password? Click here to reset