StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization

02/18/2023
by   Shirsha Bose, et al.
0

Large-scale foundation models (e.g., CLIP) have shown promising zero-shot generalization performance on downstream tasks by leveraging carefully designed language prompts. However, despite their success, most prompt learning techniques tend to underperform in the presence of domain shift. Our study addresses this problem and, to improve CLIP's generalization ability across domains, proposes StyLIP, a novel approach for Domain Generalization (DG) based on a domain-agnostic prompt learning strategy. In the absence of explicit domain knowledge, we aim to disentangle the visual style and the content information extracted from the pre-trained CLIP in the prompts so they can be effortlessly adapted to novel domains during inference. Furthermore, we consider a set of style projectors to learn the prompt tokens directly from these multi-scale style features, and the generated prompt embeddings are later fused with the multi-scale visual features learned through a content projector. The projectors are contrastively trained, given CLIP's frozen vision and text encoders. We present extensive experiments in five different DG settings on multiple benchmarks, demonstrating that StyLIP consistently outperforms the relevant state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2023

APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP

In recent years, the success of large-scale vision-language models (VLMs...
research
02/20/2023

Simple Disentanglement of Style and Content in Visual Representations

Learning visual representations with interpretable features, i.e., disen...
research
08/10/2023

AD-CLIP: Adapting Domains in Prompt Space Using CLIP

Although deep learning models have shown impressive performance on super...
research
07/25/2020

Robust and Generalizable Visual Representation Learning via Random Convolutions

While successful for various computer vision tasks, deep neural networks...
research
09/08/2023

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
research
09/07/2023

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a f...
research
11/06/2019

Open Domain Web Keyphrase Extraction Beyond Language Modeling

This paper studies keyphrase extraction in real-world scenarios where do...

Please sign up or login with your details

Forgot password? Click here to reset