Efficient Decoder-free Object Detection with Transformers

by   Peixian Chen, et al.

Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5 10x fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5 computation cost.


page 8

page 11


CBNetV2: A Composite Backbone Network Architecture for Object Detection

Consistent performance gains through exploring more effective network st...

RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection

With the publication of DINO, a variant of the Detection Transformer (DE...

Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Modern object detectors have taken the advantages of pre-trained vision ...

Transformer Transforms Salient Object Detection and Camouflaged Object Detection

The transformer networks, which originate from machine translation, are ...

AO2-DETR: Arbitrary-Oriented Object Detection Transformer

Arbitrary-oriented object detection (AOOD) is a challenging task to dete...

PosNeg-Balanced Anchors with Aligned Features for Single-Shot Object Detection

We introduce a novel single-shot object detector to ease the imbalance o...

Proper Reuse of Image Classification Features Improves Object Detection

A common practice in transfer learning is to initialize the downstream m...

Please sign up or login with your details

Forgot password? Click here to reset