Masked Image Modeling with Denoising Contrast

05/19/2022
by   Kun Yi, et al.
13

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling, there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. Masked image modeling recently dominates this line of research with state-of-the-art performance on vision Transformers, where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.

READ FULL TEXT
research
05/30/2022

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

We present a new self-supervised pre-training of Vision Transformers for...
research
02/28/2023

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Current popular backbones in computer vision, such as Vision Transformer...
research
08/30/2022

Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

While self-supervised learning has been shown to benefit a number of vis...
research
07/27/2022

Contrastive Masked Autoencoders are Stronger Vision Learners

Masked image modeling (MIM) has achieved promising results on various vi...
research
11/15/2021

iBOT: Image BERT Pre-Training with Online Tokenizer

The success of language Transformers is primarily attributed to the pret...
research
08/17/2021

Investigating transformers in the decomposition of polygonal shapes as point collections

Transformers can generate predictions in two approaches: 1. auto-regress...
research
12/28/2022

Representation Separation for Semantic Segmentation with Vision Transformers

Vision transformers (ViTs) encoding an image as a sequence of patches br...

Please sign up or login with your details

Forgot password? Click here to reset