Improving CLIP Training with Language Rewrites

05/31/2023
by   Lijie Fan, et al.
0

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2 LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

READ FULL TEXT
research
04/07/2022

Unsupervised Prompt Learning for Vision-Language Models

Contrastive vision-language models like CLIP have shown great progress i...
research
10/18/2022

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the pa...
research
11/06/2021

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Contrastive Vision-Language Pre-training, known as CLIP, has provided a ...
research
05/03/2022

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC h...
research
07/13/2023

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

We present a novel methodology aimed at optimizing the application of fr...
research
09/19/2023

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

We focus on the task of soundscape mapping, which involves predicting th...
research
03/21/2023

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Contrastive vision-language models (e.g. CLIP) are typically created by ...

Please sign up or login with your details

Forgot password? Click here to reset