DeepAI AI Chat
Log In Sign Up

Attentive Mask CLIP

by   Yifan Yang, et al.
Tongji University
Tsinghua University

Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves 43.9% top-1 accuracy on ImageNet-1K zero-shot classification, as well as 62.7/42.1 and 38.0/23.2 I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are +1.1%, +5.5/+0.9, and +4.4/+1.3 higher than the SLIP method, while being 2.30× faster. An efficient version of our approach running 1.16× faster than the plain CLIP model achieves significant gains of +5.3%, +11.3/+8.0, and +9.5/+4.9 on these benchmarks.


page 3

page 6

page 9


Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Vision transformers have achieved significant improvements on various vi...

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Masked Image Modeling (MIM) is a new self-supervised vision pre-training...

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and cons...

TIER: Text-Image Entropy Regularization for CLIP-style models

In this paper, we study the effect of a novel regularization scheme on c...

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Video understanding relies on perceiving the global content and modeling...

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures fo...

Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games

We show that Reinforcement Learning (RL) methods for solving Text-Based ...