Patch-level Representation Learning for Self-supervised Vision Transformers

06/16/2022
by   Sukmin Yun, et al.
0

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

READ FULL TEXT

page 2

page 3

page 7

research
08/30/2022

Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

While self-supervised learning has been shown to benefit a number of vis...
research
12/13/2022

OAMixer: Object-aware Mixing Layer for Vision Transformers

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have sh...
research
11/15/2021

iBOT: Image BERT Pre-Training with Online Tokenizer

The success of language Transformers is primarily attributed to the pret...
research
03/14/2023

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

We present a single neural network architecture composed of task-agnosti...
research
02/07/2022

Context Autoencoder for Self-Supervised Representation Learning

We present a novel masked image modeling (MIM) approach, context autoenc...
research
12/18/2020

Self-supervised Learning with Fully Convolutional Networks

Although deep learning based methods have achieved great success in many...
research
06/01/2022

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of transformer architecture o...

Please sign up or login with your details

Forgot password? Click here to reset