GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning

08/22/2023
by   Mainak Singha, et al.
0

Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.

READ FULL TEXT

page 1

page 15

research
04/02/2022

Mix-up Self-Supervised Learning for Contrast-agnostic Applications

Contrastive self-supervised learning has attracted significant research ...
research
06/24/2023

Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning

Self-supervised learning converts raw perceptual data such as images to ...
research
11/07/2022

Okapi: Generalising Better by Making Statistical Matches Match

We propose Okapi, a simple, efficient, and general method for robust sem...
research
08/23/2022

IMPaSh: A Novel Domain-shift Resistant Representation for Colorectal Cancer Tissue Classification

The appearance of histopathology images depends on tissue type, staining...
research
05/10/2022

Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains

Generalizing learned representations across significantly different visu...
research
10/28/2021

Self-Supervised Learning Disentangled Group Representation as Feature

A good visual representation is an inference map from observations (imag...
research
08/26/2020

Delving into Inter-Image Invariance for Unsupervised Visual Representations

Contrastive learning has recently shown immense potential in unsupervise...

Please sign up or login with your details

Forgot password? Click here to reset