Learning to Decompose Visual Features with Latent Textual Prompts

10/09/2022
by   Feng Wang, et al.
0

Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2 ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0 outperforming state-of-the-art vision-language prompt tuning method by 7.6

READ FULL TEXT

page 8

page 13

research
09/08/2023

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
research
12/14/2022

Understanding Zero-Shot Adversarial Robustness for Large-Scale Models

Pretrained large-scale vision-language models like CLIP have exhibited s...
research
09/14/2023

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Large pre-trained vision-language models such as CLIP have demonstrated ...
research
07/26/2023

ECO: Ensembling Context Optimization for Vision-Language Models

Image recognition has recently witnessed a paradigm shift, where vision-...
research
09/02/2023

Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging

We present a method for zero-shot recommendation of multimodal non-stati...
research
09/01/2023

Learned Visual Features to Textual Explanations

Interpreting the learned features of vision models has posed a longstand...
research
02/12/2023

LiT Tuned Models for Efficient Species Detection

Recent advances in training vision-language models have demonstrated unp...

Please sign up or login with your details

Forgot password? Click here to reset