Curved Representation Space of Vision Transformers

10/11/2022
by   Juyeop Kim, et al.
0

Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs) for computer vision tasks. However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident (in fact, we find Transformers are actually underconfident). This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by investigating how the output of the penultimate layer moves in the representation space as the input data moves within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. Thus, Transformers can be attacked easily after a small random jump and the perturbation in the final attacked data remains imperceptible, i.e., there does exist a decision boundary near the data. This also explains the underconfident prediction of Transformers. (4) The curved regions in the representation space start to form at an early training stage and grow throughout the training course. Some data are trapped in the regions, obstructing Transformers from reducing the training loss.

READ FULL TEXT

page 8

page 14

research
06/07/2022

Can CNNs Be More Robust Than Transformers?

The recent success of Vision Transformers is shaking the long dominance ...
research
11/26/2022

Towards Better Input Masking for Convolutional Neural Networks

The ability to remove features from the input of machine learning models...
research
04/28/2023

Representation Matters: The Game of Chess Poses a Challenge to Vision Transformers

While transformers have gained the reputation as the "Swiss army knife o...
research
11/10/2021

Are Transformers More Robust Than CNNs?

Transformer emerges as a powerful tool for visual recognition. In additi...
research
05/25/2023

Making Vision Transformers Truly Shift-Equivariant

For computer vision tasks, Vision Transformers (ViTs) have become one of...
research
02/18/2022

DataMUX: Data Multiplexing for Neural Networks

In this paper, we introduce data multiplexing (DataMUX), a technique tha...
research
02/16/2020

Robustness Verification for Transformers

Robustness verification that aims to formally certify the prediction beh...

Please sign up or login with your details

Forgot password? Click here to reset