X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

04/12/2022
by   Zhaowei Cai, et al.
6

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at  20 frames per second without using any LVIS annotation during training.

READ FULL TEXT

page 2

page 3

page 8

page 11

page 12

research
05/30/2023

Multi-modal Queried Object Detection in the Wild

We introduce MQ-Det, an efficient architecture and pre-training strategy...
research
12/01/2021

Human-Object Interaction Detection via Weak Supervision

The goal of this paper is Human-object Interaction (HO-I) detection. HO-...
research
08/21/2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Existing approaches to vision-language pre-training (VLP) heavily rely o...
research
03/09/2020

iFAN: Image-Instance Full Alignment Networks for Adaptive Object Detection

Training an object detector on a data-rich domain and applying it to a d...
research
06/16/2023

Scaling Open-Vocabulary Object Detection

Open-vocabulary object detection has benefited greatly from pretrained v...
research
03/12/2023

Towards Universal Vision-language Omni-supervised Segmentation

Existing open-world universal segmentation approaches usually leverage C...
research
08/31/2021

End-to-End Monocular Vanishing Point Detection Exploiting Lane Annotations

Vanishing points (VPs) play a vital role in various computer vision task...

Please sign up or login with your details

Forgot password? Click here to reset