Teaching Matters: Investigating the Role of Supervision in Vision Transformers

by   Matthew Walmer, et al.

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.


page 2

page 4

page 14

page 15

page 17

page 18

page 19

page 20


Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning

Though offering amazing contextualized token-level representations, curr...

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

Recently, self-supervised vision transformers have attracted unprecedent...

Spatial Entropy Regularization for Vision Transformers

Recent work has shown that the attention maps of Vision Transformers (VT...

Understanding Self-Attention of Self-Supervised Audio Transformers

Self-supervised Audio Transformers (SAT) enable great success in many do...

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Recently multimodal transformer models have gained popularity because th...

SIRL: Similarity-based Implicit Representation Learning

When robots learn reward functions using high capacity models that take ...

Guiding Attention for Self-Supervised Learning with Transformers

In this paper, we propose a simple and effective technique to allow for ...

Please sign up or login with your details

Forgot password? Click here to reset