MST: Masked Self-Supervised Transformer for Visual Representation

06/10/2021
by   Zhaowen Li, et al.
0

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4 For dense prediction tasks, MST also achieves 42.7 detection and 74.04 pre-training.

READ FULL TEXT

page 4

page 15

research
01/18/2022

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

Recently, self-supervised vision transformers have attracted unprecedent...
research
06/06/2023

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

In this paper, we propose a simple yet effective transformer framework f...
research
04/17/2022

On Effectively Learning of Knowledge in Continual Pre-training

Pre-trained language models (PLMs) like BERT have made significant progr...
research
12/03/2022

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Autoregressive language modeling (ALM) have been successfully used in se...
research
02/17/2023

Self-Supervised Representation Learning from Temporal Ordering of Automated Driving Sequences

Self-supervised feature learning enables perception systems to benefit f...
research
05/08/2023

Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding

Masked signal modeling has greatly advanced self-supervised pre-training...
research
03/22/2022

Self-supervision through Random Segments with Autoregressive Coding (RandSAC)

Inspired by the success of self-supervised autoregressive representation...

Please sign up or login with your details

Forgot password? Click here to reset