Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

04/25/2023

∙

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network and are primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the learned representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that fine-tuning re-organizes the information to be more similar to pre-trained joint embedding models.

READ FULL TEXT

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

Joint Embedding Predictive Architectures Focus on Slow Features

Exploring Model Transferability through the Lens of Potential Energy

Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging Analysis

Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer

Similarity Analysis of Contextual Word Representation Models

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Related Research

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

Joint Embedding Predictive Architectures Focus on Slow Features

Exploring Model Transferability through the Lens of Potential Energy

Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

Revisiting Fine-Tuning Strategies for Self-supervised Medical Imaging Analysis

Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer

Similarity Analysis of Contextual Word Representation Models