UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

11/18/2020
by   Zhigang Dai, et al.
0

Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to unsupervisedly pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade-off multi-task learning of classification and localization in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher precision on PASCAL VOC and COCO datasets. The code will be available soon.

READ FULL TEXT
research
05/11/2022

An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers

Self-supervised learning (SSL) methods such as masked language modeling ...
research
06/13/2022

Featurized Query R-CNN

The query mechanism introduced in the DETR method is changing the paradi...
research
11/30/2021

SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer

Recently, an event-based end-to-end model (SEDT) has been proposed for s...
research
04/12/2023

Hard Patches Mining for Masked Image Modeling

Masked image modeling (MIM) has attracted much research attention due to...
research
05/16/2023

Ray-Patch: An Efficient Decoder for Light Field Transformers

In this paper we propose the Ray-Patch decoder, a novel model to efficie...
research
06/21/2021

Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation

Moving objects have special importance for Autonomous Driving tasks. Det...
research
03/22/2022

Learning Patch-to-Cluster Attention in Vision Transformer

The Vision Transformer (ViT) model is built on the assumption of treatin...

Please sign up or login with your details

Forgot password? Click here to reset