Prototypical Contrastive Language Image Pretraining

06/22/2022
by   Delong Chen, et al.
0

Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. Combining the above novel designs, we train our ProtoCLIP on Conceptual Captions and achieved an +5.81 improvement and an +2.01 are available at https://github.com/megvii-research/protoclip.

READ FULL TEXT

page 2

page 16

page 22

page 23

research
05/28/2022

CyCLIP: Cyclic Contrastive Language-Image Pretraining

Recent advances in contrastive representation learning over paired image...
research
05/23/2023

ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

We propose ConGraT(Contrastive Graph-Text pretraining), a general, self-...
research
05/27/2022

Multimodal Masked Autoencoders Learn Transferable Representations

Building scalable models to learn from diverse, multimodal data remains ...
research
07/26/2020

Contrastive Visual-Linguistic Pretraining

Several multi-modality representation learning approaches such as LXMERT...
research
10/11/2021

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has...
research
02/05/2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Mainstream 3D representation learning approaches are built upon contrast...
research
10/10/2022

HiCo: Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining

The self-supervised ultrasound (US) video model pretraining can use a sm...

Please sign up or login with your details

Forgot password? Click here to reset