Masked Unsupervised Self-training for Zero-shot Image Classification

06/07/2022
by   Junnan Li, et al.
0

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data to improve the performance of a pre-trained zero-shot classifier on downstream tasks. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification. For instance, MUST achieves a zero-shot top-1 accuracy of 77.7 ViT-B, +9.4 https://github.com/salesforce/MUST.

READ FULL TEXT

page 4

page 7

page 8

research
04/07/2022

Unsupervised Prompt Learning for Vision-Language Models

Contrastive vision-language models like CLIP have shown great progress i...
research
01/22/2022

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language proc...
research
06/02/2023

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks ...
research
05/18/2018

Self-Training Ensemble Networks for Zero-Shot Image Recognition

Despite the advancement of supervised image recognition algorithms, thei...
research
06/15/2023

Robustness Analysis on Foundational Segmentation Models

Due to the increase in computational resources and accessibility of data...
research
04/12/2022

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Training a referring expression comprehension (ReC) model for a new visu...
research
02/13/2023

A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

Contrastively trained text-image models have the remarkable ability to p...

Please sign up or login with your details

Forgot password? Click here to reset