Vision Transformers provably learn spatial structure

10/13/2022
by   Samy Jelassi, et al.
0

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any visual inductive bias of spatial locality. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns. This raises a central question: how do ViTs learn these patterns by solely minimizing their training loss using gradient-based methods from random initialization? In this paper, we provide some theoretical justification of this phenomenon. We propose a spatially structured dataset and a simplified ViT model. In this model, the attention matrix solely depends on the positional encodings. We call this mechanism the positional attention mechanism. On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association. We prove that patch association helps to sample-efficiently transfer to downstream datasets that share the same structure as the pre-training one but differ in the features. Lastly, we empirically verify that a ViT with positional attention performs similarly to the original one on CIFAR-10/100, SVHN and ImageNet.

READ FULL TEXT

page 2

page 10

page 18

research
09/16/2022

ConvFormer: Closing the Gap Between CNN and Vision Transformers

Vision transformers have shown excellent performance in computer vision ...
research
08/19/2021

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model...
research
12/27/2021

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer st...
research
01/26/2022

Training Vision Transformers with Only 2040 Images

Vision Transformers (ViTs) is emerging as an alternative to convolutiona...
research
07/02/2023

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

Convolutional neural networks (CNNs) and vision transformers (ViT) have ...
research
11/12/2021

Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification

This paper compares well-established Convolutional Neural Networks (CNNs...
research
07/26/2023

Sparse Double Descent in Vision Transformers: real or phantom threat?

Vision transformers (ViT) have been of broad interest in recent theoreti...

Please sign up or login with your details

Forgot password? Click here to reset