VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

06/02/2023
by   Zhiqiu Lin, et al.
0

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the Visual Generative Pre-Training Score (VisualGPTScore) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2023

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Building scalable vision-language models to learn from diverse, multimod...
research
09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...
research
09/12/2022

PreSTU: Pre-Training for Scene-Text Understanding

The ability to read and reason about texts in an image is often lacking ...
research
07/18/2023

Augmenting CLIP with Improved Visio-Linguistic Reasoning

Image-text contrastive models such as CLIP are useful for a variety of d...
research
05/25/2022

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...
research
09/22/2021

Caption Enriched Samples for Improving Hateful Memes Detection

The recently introduced hateful meme challenge demonstrates the difficul...
research
12/15/2020

Pre-Training Transformers as Energy-Based Cloze Models

We introduce Electric, an energy-based cloze model for representation le...

Please sign up or login with your details

Forgot password? Click here to reset