MILAN: Masked Image Pretraining on Language Assisted Representation

08/11/2022
by   Zejiang Hou, et al.
0

Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4 state-of-the-arts by 1 achieves 52.7 mIoU using ViT-B/16 backbone on ADE20K dataset, outperforming previous masked pretraining results by 4 points.

READ FULL TEXT

page 8

page 17

research
06/11/2020

VirTex: Learning Visual Representations from Textual Annotations

The de-facto approach to many vision tasks is to start from pretrained v...
research
03/23/2021

Self-Supervised Pretraining Improves Self-Supervised Pretraining

While self-supervised pretraining has proven beneficial for many compute...
research
08/31/2021

Sentence Bottleneck Autoencoders from Transformer Language Models

Representation learning for text via pretraining a language model on a l...
research
08/12/2021

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Large-scale pretraining of visual representations has led to state-of-th...
research
07/14/2022

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

We propose bootstrapped masked autoencoders (BootMAE), a new approach fo...
research
08/04/2020

Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of...
research
04/27/2022

Forecasting Urban Development from Satellite Images

Forecasting where and when new buildings will emerge is a rather unexplo...

Please sign up or login with your details

Forgot password? Click here to reset