Improvements to Self-Supervised Representation Learning for Masked Image Modeling

05/21/2022
by   Jiawei Mao, et al.
0

This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part. We found the following three main directions for MIM to be improved. First, since both encoders and decoders contribute to representation learning, MIM uses only encoders for downstream tasks, which ignores the impact of decoders on representation learning. Although the MIM paradigm already employs small decoders with asymmetric structures, we believe that continued reduction of decoder parameters is beneficial to improve the representational learning capability of the encoder . Second, MIM solves the image prediction task by training the encoder and decoder together , and does not design a separate task for the encoder . To further enhance the performance of the encoder when performing downstream tasks, we designed the encoder for the tasks of comparative learning and token position prediction. Third, since the input image may contain background and other objects, and the proportion of each object in the image varies, reconstructing the tokens related to the background or to other objects is not meaningful for MIM to understand the main object representations. Therefore we use ContrastiveCrop to crop the input image so that the input image contains as much as possible only the main objects. Based on the above three improvements to MIM, we propose a new model, Contrastive Masked AutoEncoders (CMAE). We achieved a Top-1 accuracy of 65.84 tinyimagenet using the ViT-B backbone, which is +2.89 outperforming the MAE of competing methods when all conditions are equal. Code will be made available.

READ FULL TEXT

page 3

page 4

research
11/11/2021

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-superv...
research
02/05/2023

CIPER: Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning

Self-supervised representation learning (SSRL) methods have shown great ...
research
12/03/2022

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Autoregressive language modeling (ALM) have been successfully used in se...
research
10/28/2017

Exploring Asymmetric Encoder-Decoder Structure for Context-based Sentence Representation Learning

Context information plays an important role in human language understand...
research
11/30/2021

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Diffusion probabilistic models (DPMs) have achieved remarkable quality i...
research
10/19/2022

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

Masked Image Modeling (MIM) has recently been established as a potent pr...
research
11/16/2022

Boosting Object Representation Learning via Motion and Object Continuity

Recent unsupervised multi-object detection models have shown impressive ...

Please sign up or login with your details

Forgot password? Click here to reset