Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

06/01/2022
by   Guglielmo Camporese, et al.
14

Vision Transformers (ViTs) enabled the use of transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step. We investigated our proposed methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets.

READ FULL TEXT
research
04/29/2021

Emerging Properties in Self-Supervised Vision Transformers

In this paper, we question if self-supervised learning provides new prop...
research
06/16/2022

Patch-level Representation Learning for Self-supervised Vision Transformers

Recent self-supervised learning (SSL) methods have shown impressive resu...
research
05/30/2022

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...
research
05/27/2022

Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

Masked image modeling (MIM), an emerging self-supervised pre-training me...
research
07/27/2020

Representation Learning with Video Deep InfoMax

Self-supervised learning has made unsupervised pretraining relevant agai...
research
06/10/2022

Position Labels for Self-Supervised Vision Transformer

Position encoding is important for vision transformer (ViT) to capture t...
research
04/08/2021

HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies

This paper presents a model architecture for encoding the representation...

Please sign up or login with your details

Forgot password? Click here to reset