What do Vision Transformers Learn? A Visual Exploration

12/13/2022
by   Amin Ghiasi, et al.
0

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.

READ FULL TEXT

page 20

page 29

page 30

page 33

page 34

page 36

page 37

page 38

research
08/19/2021

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model...
research
08/20/2022

Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks

Vision Transformers have emerged as a powerful architecture that can out...
research
06/23/2021

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

In this paper, we present Vision Permutator, a conceptually simple and d...
research
12/29/2022

AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers

Vision transformers have emerged as powerful tools for many computer vis...
research
06/11/2023

2-D SSM: A General Spatial Layer for Visual Transformers

A central objective in computer vision is to design models with appropri...
research
10/14/2022

Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?

Recently vision transformers (ViT) have been applied successfully for va...
research
06/10/2022

Learning to Estimate Shapley Values with Vision Transformers

Transformers have become a default architecture in computer vision, but ...

Please sign up or login with your details

Forgot password? Click here to reset