CLIP also Understands Text: Prompting CLIP for Phrase Understanding

10/11/2022
by   An Yan, et al.
6

Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision. CLIP and its visual encoder have been explored on various vision and language tasks and achieve strong zero-shot or transfer learning performance. However, the application of its text encoder solely for text understanding has been less explored. In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt. Extensive experiments validate the effectiveness of our method across different datasets and domains on entity clustering and entity set expansion tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2023

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Most humans use visual imagination to understand and reason about langua...
research
11/25/2022

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zer...
research
06/06/2023

On the Difference of BERT-style and CLIP-style Text Encoders

Masked language modeling (MLM) has been one of the most popular pretrain...
research
11/30/2021

An implementation of the "Guess who?" game using CLIP

CLIP (Contrastive Language-Image Pretraining) is an efficient method for...
research
03/20/2023

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

Vision-Language models like CLIP have been widely adopted for various ta...
research
08/30/2023

AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization

Contrastive Language-Image Pre-training (CLIP) models have shown promisi...

Please sign up or login with your details

Forgot password? Click here to reset