AAformer: Auto-Aligned Transformer for Person Re-Identification

04/02/2021
by   Kuan Zhu, et al.
0

Transformer is showing its superiority over convolutional architectures in many vision tasks like image classification and object detection. However, the lacking of an explicit alignment mechanism limits its capability in person re-identification (re-ID), in which there are inevitable misalignment issues caused by pose/viewpoints variations, etc. On the other hand, the alignment paradigm of convolutional neural networks does not perform well in Transformer in our experiments. To address this problem, we develop a novel alignment framework for Transformer through adding the learnable vectors of "part tokens" to learn the part representations and integrating the part alignment into the self-attention. A part token only interacts with a subset of patch embeddings and learns to represent this subset. Based on the framework, we design an online Auto-Aligned Transformer (AAformer) to adaptively assign the patch embeddings of the same semantics to the identical part token in the running time. The part tokens can be regarded as the part prototypes, and a fast variant of Sinkhorn-Knopp algorithm is employed to cluster the patch embeddings to part tokens online. AAformer can be viewed as a new principled formulation for simultaneously learning both part alignment and part representations. Extensive experiments validate the effectiveness of part tokens and the superiority of AAformer over various state-of-the-art CNN-based methods. Our codes will be released.

READ FULL TEXT

page 1

page 8

research
06/04/2021

RegionViT: Regional-to-Local Attention for Vision Transformers

Vision transformer (ViT) has recently showed its strong capability in ac...
research
03/27/2021

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The recently developed vision transformer (ViT) has achieved promising r...
research
06/07/2021

Person Re-Identification with a Locally Aware Transformer

Person Re-Identification is an important problem in computer vision-base...
research
11/24/2021

An Image Patch is a Wave: Phase-Aware Vision MLP

Different from traditional convolutional neural network (CNN) and vision...
research
12/06/2022

Semantic-aware Message Broadcasting for Efficient Unsupervised Domain Adaptation

Vision transformer has demonstrated great potential in abundant vision t...
research
03/23/2023

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a dro...
research
12/12/2016

Inverse Compositional Spatial Transformer Networks

In this paper, we establish a theoretical connection between the classic...

Please sign up or login with your details

Forgot password? Click here to reset