Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

04/25/2023
by   Shashank Shekhar, et al.
0

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network and are primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the learned representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that fine-tuning re-organizes the information to be more similar to pre-trained joint embedding models.

READ FULL TEXT

page 3

page 4

page 8

page 10

page 11

page 21

page 26

page 27

research
06/10/2023

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

This study is focused on understanding and quantifying the change in pho...
research
11/20/2022

Joint Embedding Predictive Architectures Focus on Slow Features

Many common methods for learning a world model for pixel-based environme...
research
08/29/2023

Exploring Model Transferability through the Lens of Potential Energy

Transfer learning has become crucial in computer vision tasks due to the...
research
05/23/2023

Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

Self-supervised learning (SSL) models use only the intrinsic structure o...
research
07/20/2023

Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging Analysis

Despite the rapid progress in self-supervised learning (SSL), end-to-end...
research
10/28/2022

Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer

Self-supervised representation learning (SSL) methods provide an effecti...
research
05/03/2020

Similarity Analysis of Contextual Word Representation Models

This paper investigates contextual word representation models from the l...

Please sign up or login with your details

Forgot password? Click here to reset