Visualizing and Understanding Patch Interactions in Vision Transformer

by   Jie Ma, et al.

Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28 results on downstream fine-grained recognition tasks further validate the generalization of our proposal.


page 4

page 6

page 7

page 11

page 12

page 13

page 14

page 15


CAT: Cross Attention in Vision Transformer

Since Transformer has found widespread use in NLP, the potential of Tran...

Graph Reasoning Transformer for Image Parsing

Capturing the long-range dependencies has empirically proven to be effec...

Pattern Attention Transformer with Doughnut Kernel

We present in this paper a new architecture, the Pattern Attention Trans...

Sequence and Circle: Exploring the Relationship Between Patches

The vision transformer (ViT) has achieved state-of-the-art results in va...

R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut

Transformer-based models have gained popularity in the field of natural ...

UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

Intra-frame inconsistency has been proved to be effective for the genera...

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Fine-grained visual classification (FGVC) is a challenging computer visi...

Please sign up or login with your details

Forgot password? Click here to reset