ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

07/01/2023
by   Uddeshya Upadhyay, et al.
0

Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.

READ FULL TEXT

page 3

page 8

research
01/13/2021

Probabilistic Embeddings for Cross-Modal Retrieval

Cross-modal retrieval methods build a common representation space for sa...
research
08/22/2023

Ceci n'est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings

Multi-modal encoders map images, sounds, texts, videos, etc. into a sing...
research
07/26/2023

Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models

The rapid growth and increasing popularity of incorporating additional m...
research
06/04/2023

Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model

Sentence embedding is one of the most fundamental tasks in Natural Langu...
research
02/08/2023

Diagnosing and Rectifying Vision Models using Language

Recent multi-modal contrastive learning models have demonstrated the abi...
research
07/10/2023

SITTA: A Semantic Image-Text Alignment for Image Captioning

Textual and semantic comprehension of images is essential for generating...
research
04/17/2021

Robust Embeddings Via Distributions

Despite recent monumental advances in the field, many Natural Language P...

Please sign up or login with your details

Forgot password? Click here to reset