Multi-Modal Classifiers for Open-Vocabulary Object Detection

06/08/2023
by   Prannay Kaul, et al.
9

The goal of this paper is open-vocabulary object detection (OVOD) x2013 building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

READ FULL TEXT

page 1

page 4

page 9

page 21

page 22

page 23

page 24

research
05/30/2023

Multi-modal Queried Object Detection in the Wild

We introduce MQ-Det, an efficient architecture and pre-training strategy...
research
08/30/2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

In this paper, we for the first time explore helpful multi-modal context...
research
05/12/2022

Localized Vision-Language Matching for Open-vocabulary Object Detection

In this work, we propose an open-world object detection method that, bas...
research
04/25/2023

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

Named entities are ubiquitous in text that naturally accompanies images,...
research
02/08/2023

Diagnosing and Rectifying Vision Models using Language

Recent multi-modal contrastive learning models have demonstrated the abi...
research
03/18/2021

Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons

With Open AI's publishing of their CLIP model (Contrastive Language-Imag...
research
05/23/2023

DetGPT: Detect What You Need via Reasoning

In recent years, the field of computer vision has seen significant advan...

Please sign up or login with your details

Forgot password? Click here to reset