MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

08/25/2022
by   Xiaoyi Dong, et al.
18

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation.Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Empirically, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning as well as the zero-shot performance with the guidance of the language encoder.

READ FULL TEXT

page 7

page 14

page 15

page 16

research
10/17/2022

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
research
02/25/2021

How to represent part-whole hierarchies in a neural network

This paper does not describe a working system. Instead, it presents a si...
research
09/19/2023

Improving CLIP Robustness with Knowledge Distillation and Self-Training

This paper examines the robustness of a multi-modal computer vision mode...
research
01/15/2022

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...
research
09/02/2023

Contrastive Feature Masking Open-Vocabulary Vision Transformer

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an...
research
02/05/2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Mainstream 3D representation learning approaches are built upon contrast...
research
07/18/2023

Augmenting CLIP with Improved Visio-Linguistic Reasoning

Image-text contrastive models such as CLIP are useful for a variety of d...

Please sign up or login with your details

Forgot password? Click here to reset