Hard Patches Mining for Masked Image Modeling

04/12/2023
by   Haochen Wang, et al.
0

Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.

READ FULL TEXT

page 1

page 2

page 4

page 8

page 15

page 16

research
05/28/2022

Object-wise Masked Autoencoders for Fast Pre-training

Self-supervised pre-training for images without labels has recently achi...
research
11/18/2020

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Object detection with transformers (DETR) reaches competitive performanc...
research
03/12/2023

Improving Masked Autoencoders by Learning Where to Mask

Masked image modeling is a promising self-supervised learning method for...
research
11/17/2022

CAE v2: Context Autoencoder with CLIP Target

Masked image modeling (MIM) learns visual representation by masking and ...
research
04/18/2022

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

The self-supervised Masked Image Modeling (MIM) schema, following "mask-...
research
11/16/2022

Stare at What You See: Masked Image Modeling without Reconstruction

Masked Autoencoders (MAE) have been prevailing paradigms for large-scale...
research
08/23/2022

Learning Better Masking for Better Language Model Pre-training

Masked Language Modeling (MLM) has been widely used as the denoising obj...

Please sign up or login with your details

Forgot password? Click here to reset