Siamese Image Modeling for Self-Supervised Vision Representation Learning

06/02/2022
by   Chenxin Tao, et al.
0

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together the representations of different views from the same image, while avoiding feature collapse. It does well on linear probing but is inferior in detection performance. On the other hand, MIM reconstructs the original content given a masked image. It excels at dense prediction but fails to perform well on linear probing. Their distinctions are caused by neglecting the representation requirements of either semantic alignment or spatial sensitivity. Specifically, we observe that (1) semantic alignment demands semantically similar views to be projected into nearby representation, which can be achieved by contrasting different views with strong augmentations; (2) spatial sensitivity requires to model the local structure within an image. Predicting dense representations with masked image is therefore beneficial because it models the conditional distribution of image content. Driven by these analysis, we propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. Our method uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. In this way, we are able to achieve comparable linear probing and dense prediction performances with ID and MIM, respectively. We also demonstrate that decent linear probing result can be obtained without a global loss. Code shall be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2021

Exploring Set Similarity for Dense Self-supervised Representation Learning

By considering the spatial correspondence, dense self-supervised represe...
research
03/31/2023

Siamese DETR

Recent self-supervised methods are mainly designed for representation le...
research
10/26/2021

Directional Self-supervised Learning for Risky Image Augmentations

Only a few cherry-picked robust augmentation policies are beneficial to ...
research
07/09/2022

A Study on Self-Supervised Object Detection Pretraining

In this work, we study different approaches to self-supervised pretraini...
research
02/19/2021

Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction

State-of-the-art methods for self-supervised learning (SSL) build repres...
research
04/04/2019

Siamese Encoding and Alignment by Multiscale Learning with Self-Supervision

We propose a method of aligning a source image to a target image, where ...
research
07/27/2022

Contrastive Masked Autoencoders are Stronger Vision Learners

Masked image modeling (MIM) has achieved promising results on various vi...

Please sign up or login with your details

Forgot password? Click here to reset