PatchFormer: A Versatile 3D Transformer Based on Patch Attention

10/30/2021
by   Zhang Cheng, et al.
16

The 3D vision community is witnesses a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major 3D learning benchmarks. However, existing 3D Transformers need to generate a large attention map, which has quadratic complexity (both in space and time) with respect to input size. To solve this shortcoming, we introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed. By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size. In addition, we propose a lightweight Multi-scale Attention (MSA) block to build attentions among features of different scales, providing the model with multi-scale features. Based on these proposed modules, we construct our neural architecture called PatchFormer. Extensive experiments demonstrate that our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.

READ FULL TEXT

page 1

page 7

research
02/03/2023

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

As a de facto solution, the vanilla Vision Transformers (ViTs) are encou...
research
01/06/2023

Model-Agnostic Hierarchical Attention for 3D Object Detection

Transformers as versatile network architectures have recently seen great...
research
07/17/2021

RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

In fine-grained image recognition (FGIR), the localization and amplifica...
research
05/25/2023

Concept-Centric Transformers: Concept Transformers with Object-Centric Concept Learning for Interpretability

Attention mechanisms have greatly improved the performance of deep-learn...
research
06/13/2023

Reviving Shift Equivariance in Vision Transformers

Shift equivariance is a fundamental principle that governs how we percei...
research
05/16/2023

Ray-Patch: An Efficient Decoder for Light Field Transformers

In this paper we propose the Ray-Patch decoder, a novel model to efficie...
research
09/15/2022

Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Determining an appropriate number of attention heads on one hand and the...

Please sign up or login with your details

Forgot password? Click here to reset