Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

08/24/2022
by   Gongjie Zhang, et al.
0

Multi-scale features have been proven highly effective for object detection, and most ConvNet-based object detectors adopt Feature Pyramid Network (FPN) as a basic component for exploiting multi-scale features. However, for the recently proposed Transformer-based object detectors, directly incorporating multi-scale features leads to prohibitive computational overhead due to the high complexity of the attention mechanism for processing high-resolution features. This paper presents Iterative Multi-scale Feature Aggregation (IMFA) – a generic paradigm that enables the efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with slight computational overhead. Project page: https://github.com/ZhangGongjie/IMFA.

READ FULL TEXT
research
05/01/2021

Lite-FPN for Keypoint-based Monocular 3D Object Detection

3D object detection with a single image is an essential and challenging ...
research
08/22/2021

Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

After DETR was proposed, this novel transformer-based detection paradigm...
research
04/15/2019

DuBox: No-Prior Box Objection Detection via Residual Dual Scale Detectors

Traditional neural objection detection methods use multi-scale features ...
research
05/08/2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for variou...
research
07/28/2022

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

The recently proposed DEtection TRansformer (DETR) has established a ful...
research
08/18/2023

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Many medical or pharmaceutical processes have strict guidelines regardin...
research
03/14/2022

Accelerating DETR Convergence via Semantic-Aligned Matching

The recently developed DEtection TRansformer (DETR) establishes a new ob...

Please sign up or login with your details

Forgot password? Click here to reset