TIER: Text-Image Entropy Regularization for CLIP-style models

12/13/2022
by   Anil Palepu, et al.
0

In this paper, we study the effect of a novel regularization scheme on contrastive language-image pre-trained (CLIP) models. Our approach is based on the observation that, in many domains, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context where this underlying hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) zero-shot performance on all tasks from the CheXpert chest x-ray dataset, outperforming an unregularized version of the model and several recently published self-supervised models.

READ FULL TEXT

page 3

page 8

research
04/13/2023

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

In this paper, we propose an embarrassingly simple yet highly effective ...
research
03/15/2023

Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer

Diffusion models have shown great promise in text-guided image style tra...
research
04/28/2021

Zero-Shot Detection via Vision and Language Knowledge Distillation

Zero-shot image classification has made promising progress by training t...
research
01/19/2022

CM3: A Causal Masked Multimodal Model of the Internet

We introduce CM3, a family of causally masked generative models trained ...
research
09/07/2023

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a f...
research
12/16/2022

Attentive Mask CLIP

Image token removal is an efficient augmentation strategy for reducing t...
research
04/18/2023

Token Imbalance Adaptation for Radiology Report Generation

Imbalanced token distributions naturally exist in text documents, leadin...

Please sign up or login with your details

Forgot password? Click here to reset