CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

03/04/2023
by   Yanxin Long, et al.
1

Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1 classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44

READ FULL TEXT

page 1

page 4

page 6

page 7

page 11

page 12

page 13

research
06/22/2018

RUC+CMU: System Report for Dense Captioning Events in Videos

This notebook paper presents our system in the ActivityNet Dense Caption...
research
06/12/2022

GLIPv2: Unifying Localization and Vision-Language Understanding

We present GLIPv2, a grounded VL understanding model, that serves both l...
research
04/18/2022

End-to-end Dense Video Captioning as Sequence Generation

Dense video captioning aims to identify the events of interest in an inp...
research
11/24/2015

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

We introduce the dense captioning task, which requires a computer vision...
research
03/21/2023

Detecting Everything in the Open World: Towards Universal Object Detection

In this paper, we formally address universal object detection, which aim...
research
07/26/2017

Deep Interactive Region Segmentation and Captioning

With recent innovations in dense image captioning, it is now possible to...
research
03/23/2023

Three ways to improve feature alignment for open vocabulary detection

The core problem in zero-shot open vocabulary detection is how to align ...

Please sign up or login with your details

Forgot password? Click here to reset