End-to-End Object Detection with Adaptive Clustering Transformer

11/18/2020
by   Minghang Zheng, et al.
0

End-to-end Object Detection with Transformer (DETR)proposes to perform object detection with Transformer and achieve comparable performance with two-stage object detection like Faster-RCNN. However, DETR needs huge computational resources for training and inference due to the high-resolution spatial input. In this paper, a novel variant of transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input. ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction using the prototype-key interaction. ACT can reduce the quadratic O(N2) complexity inside self-attention into O(NK) where K is the number of prototypes in each layer. ACT can be a drop-in module replacing the original self-attention module without any training. ACT achieves a good balance between accuracy and computation cost (FLOPs). The code is available as supplementary for the ease of experiment replication and verification.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
10/29/2020

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Existing object detection frameworks are usually built on a single forma...
research
06/06/2021

Oriented Object Detection with Transformer

Object detection with Transformers (DETR) has achieved a competitive per...
research
10/08/2022

Towards Light Weight Object Detection System

Transformers are a popular choice for classification tasks and as backbo...
research
12/13/2022

CNN-transformer mixed model for object detection

Object detection, one of the three main tasks of computer vision, has be...
research
08/18/2023

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Many medical or pharmaceutical processes have strict guidelines regardin...
research
05/29/2022

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Vision Transformer (ViT) has achieved remarkable performance in many vis...
research
11/29/2021

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

DETR is the first end-to-end object detector using a transformer encoder...

Please sign up or login with your details

Forgot password? Click here to reset